If you are using nbviewer you can change to slides mode by clicking on the icon:
import random, math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
#plt.rc('text', usetex=True)
plt.rc('font', family='serif')
#plt.rc('text.latex', preamble='\\usepackage{libertine}\n\\usepackage[utf8]{inputenc}')
# numpy - pretty matrix
np.set_printoptions(precision=3, threshold=1000, edgeitems=5, linewidth=80, suppress=True)
import seaborn
seaborn.set(style='whitegrid'); seaborn.set_context('talk')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
# Fixed seed to make the results replicable - remove in real life!
random.seed(42)
We can assume that $k=1$ and noise is Gaussian and that $\Psi$ is finite.
We will assume that the target variable is described by
$$y = \vec{F}(\vec{x},\vec{w}) + \varepsilon$$where:
The derivations provided below all assume Gaussian noise on the target data. Is this a good assumption? In many cases yes. The argument hinges on the use of the Central_Limit_Theorem that basically says the the sum of many independent random variables behaves behaves like a Gaussian distributed random variable.
The noise term in this model, $\epsilon$, can be thought of as the sum of features not included in the model function, $\vec{F}(\vec{x},\vec{w})$. Assuming these features are themselves independent random variables then the Central Limit Theorem suggests a Gaussian model is appropriate, assuming there are many independent unaccounted for features.
It is possible that there is only a small number of unaccounted for features or that there is genuine non-Gaussian noise in our observation measurements, e.g. sensor shot noise that often has a Poisson distribution. In such cases, the assumption is no longer valid.
Training dataset
N = 5 # Size of the data set
X = np.array([[0.0], [1.0], [2.0], [3.0], [4.0]]) # Inputs, shape: N x 1
y = np.array([10.5, 5.0, 3.0, 2.5, 1.0]) # Outputs, shape: N
X.T
array([[ 0., 1., 2., 3., 4.]])
y
array([ 10.5, 5. , 3. , 2.5, 1. ])
Test dataset
N_test = 100
X_test = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)
plt.plot(X[:, 0], y, 'bo', label='Training')
plt.plot(X_test[:, 0], y_test, "g.", label='Test')
plt.xlabel('$X$'); plt.ylabel('$y$'); plt.legend(frameon=True);
where $\vec{w}$ and $b$ constitute a weight vector and $\varepsilon^{(n)} \sim \mathcal{N}(0, \sigma^2_\varepsilon)$.
A shorter version of this expression is the following equation:
$$\vec{y} + \vec{\epsilon} = \vec{X} \cdot \vec{w},\ \vec{X} \in \mathbb{R}^{m \times n},\ \vec{y} \in \mathbb{R}^m,\ \vec{\epsilon} \in \mathbb{R}^m,\ \vec{w} \in \mathbb{R}^n,$$where the $i$-th row of $\vec{X}$ represents $\vec{x}^{(i)}$ and the $j$-th entry of the vector $\vec{y}$ represents $\vec{y}^{(j)}$.
where $\left\|\vec{A}\right\|_2$ is called Frobenius norm and is the generalization of the Euclidean norm for matrices.
The frequentist (maximum likelihood) approach
Under the Gaussian noise condition it can be shown that the maximum likelihood function for the training data is $$ \begin{align} p(\vec{y}|\vec{X},\vec{w},\sigma^2) & = \prod_{i=1}^m {\mathcal{N}\left(y^{(i)}\left|\vec{w}^\intercal\phi(\vec{x}^{(i)}),\sigma^2\right.\right)} \\ & =\frac{m}{2}\ln\left(\frac{1}{\sigma^2}\right) -\frac{m}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{i=1}^m\left\{t_n -\vec{w}^\intercal\phi(\vec{x}^{(i)})\right\}^2\,, \end{align} $$ where $\vec{X}=\{\vec{x}^{(1)},\ldots,\vec{x}^{(m)}\}$ is the input value set for the corresponding $N$ observed output values contained in the vector $\mathbf{t}$, and $\mathcal{N}(\mu,\sigma^2)$ is the Normal Distribution (Gaussian).
Taking the logarithm of the maximum likelihood and setting the derivative with respect to $\vec{w}$ equal to zero, one can obtain the maximum likelihood parameters given by the normal equations: $$\vec{w}_\text{ML} = \left(\vec{\Phi}^\intercal\vec{\Phi}\right)^{-1}\vec{\Phi}^\intercal\vec{y},$$ where $\vec{\Phi}$ is the $N \times M$ design matrix with elements $\Phi_{i,j}=\phi_j(\vec{x}^{(i)})$, and $\vec{y}$ is the $N \times K$ matrix of training set target values (for $K=1$, it is simply a column vector).
Note that $\vec{\Phi}^\intercal$ is a $M \times N$ matrix, so that $\vec{w}_{ML}=\left(\vec{\Phi}^\intercal\vec{\Phi}\right)^{-1}\vec{\Phi}^\intercal\vec{y}$ is $(M \times N)\times(N \times M)\times(M\times N)\times(N \times K) = M \times K$, where $M$ is the number of free parameters and $K$ is the number of predicted target values for a given input.
Note that the only term in the likelihood function that depends on $\vec{w}$ is the last term. Thus, maximizing the likelihood function with respect to $\vec{w}$ under the assumption of Gaussian noise is equivalent to minimizing a sum-of-squares error function.
There are two ways to adjust the weights of a linear model (which includes linear models with nonlinear features):
In addition, we can add a constraint to the objective function to penalize large weights: Tikhonov regularization (forward pointer).
Solving $\vec{X} \cdot \vec{w} = \vec{y}$ means solving $$ \left\{ \begin{array}{rcrcl} x^{(1)}_1 w_1 & + x^{(1)}_2 w_2 & + \cdots & + x^{(1)}_n w_n & = y^{(1)} \\ x^{(2)}_1 w_1 & + x^{(2)}_2 w_2 & + \cdots & + x^{(2)}_n w_n & = y^{(2)} \\ x^{(3)}_1 w_1 & + x^{(3)}_2 w_2 & + \cdots & + x^{(3)}_n w_n & = y^{(3)} \\ \vdots & \vdots & \ddots & \vdots & = \vdots \\ x^{(m)}_1 w_1 & + x^{(m)}_2 w_2 & + \cdots & + x^{(m)}_n w_n & = y^{(m)} \end{array} \right.\,. $$
Note: A system of equations is considered overdetermined if there are more equations than unknowns. An overdetermined system is almost always inconsistent (it has no solution) when constructed with random coefficients. However, an overdetermined system will have solutions in some cases, for example if some equation occurs several times in the system, or if some equations are linear combinations of the others.
That is, the solution of
$$\vec{X}^\intercal\vec{X} \hat{\boldsymbol{w}} = \vec{X}^\intercal\vec{y}$$for $\hat{\boldsymbol{w}}$, i.e.,
$$\hat{\vec{w}} = (\vec{X}^\intercal\vec{X})^{-1}\vec{X}^\intercal\vec{y}.$$The expression $(\vec{X}^\intercal\vec{X})^{-1}\vec{X}^\intercal$ is equivalent to the Moore-Penrose pseudoinverse $\vec{X}^+$ of $\vec{X}$. It is a generalization of the inverse for non-square matrices. You can use this to implement normal equations if your library provides the function. You could also use a least squares solver (e.g. numpy.linalg.lstsq
).
See https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics) for more details.
def normal_equations(X, y):
# w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
# ... or we can use the Moore-Penrose pseudoinverse (is usually more likely to
# be numerically stable)
w = np.linalg.pinv(X).dot(y)
# ... or we use the solver
# w = np.linalg.lstsq(X, y)[0]
return w
Add bias to each row/instance:
X_bias = np.hstack((X, np.ones((N, 1))))
X_test_bias = np.hstack((X_test, np.ones((N_test, 1))))
X.T
array([[ 0., 1., 2., 3., 4.]])
X_bias.T
array([[ 0., 1., 2., 3., 4.], [ 1., 1., 1., 1., 1.]])
w_norm = normal_equations(X_bias, y)
plt.plot(X_bias[:, 0], y, "o")
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "m-", linewidth=2)
plt.xlabel("x"); plt.ylabel("y"); plt.title("Model: $y = %.2f x + %.2f$" % tuple(w_norm));
We can now solve linear regression problems at will!
Test the other forms of normal equations.
Think of this problem:
Assuming that each pixel is represented by three bytes defining the RGB values:
n_pixels = 256 * 256
memory_size_xtx = n_pixels * n_pixels * 3 * 8
print("Required memory: %d GB" % (memory_size_xtx / 2**30))
Required memory: 96 GB
Inversion is an procedure that requires $O(n^{2.373})$ operations (Source).
In cases where the training data set is very large or data is received in a stream, a direct solution using the normal equations may not be possible. An alternative approach is the stochastic gradient descent algorithm.
Iteratively update the weights using the gradient as reference.
$\Delta \vec{w}_t$ is minus the derivative of the error function $E(\vec{X}, \vec{y};\vec{w}_t)$ with respect to the weight $w_i$:
$$ \Delta w_i = - \alpha\frac{\partial E(\vec{X}, \vec{y};\vec{w})}{\partial w_i}\,, $$where:
Update $\vec{w}$ incrementally in every iteration $t$ as:
$$ \vec{w}(t+1) = \vec{w}(t) + \Delta \vec{w}(t)\,, $$Using as error function the sum of squared errors (SSE),
$$E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \frac{1}{2}\left\|\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y}\right\|^2_2\,.$$The gradient becomes
$$\nabla \boldsymbol{w} = \nabla_{\boldsymbol{w}} E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \boldsymbol{X}^T \cdot (\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y})\,.$$def error(X, y, w):
return 0.5*np.linalg.norm(X.dot(w) - y)**2
def linear_regression_gradient(X, y, w):
return X.T.dot(X.dot(w)-y)
Initialize weights to small random values
w = np.random.random(2)/10000
alpha = 0.05
for i in range(50):
# print('Iteration', i+1, 'weights', w, 'error', error(X_bias, y, w))
w = w - alpha*linear_regression_gradient(X_bias, y, w)
print('w_norm:', w_norm, 'w grad. desc.', w)
w_norm: [-2.15 8.7 ] w grad. desc. [-2.089 8.526]
def plot_contour(X_data, y_data, bounds, resolution=50, cmap=cm.viridis,
alpha=0.3, linewidth=5, rstride=1, cstride=5, ax=None):
(minx,miny),(maxx,maxy) = bounds
x_range = np.linspace(minx, maxx, num=resolution)
y_range = np.linspace(miny, maxy, num=resolution)
X, Y = np.meshgrid(x_range, y_range)
Z = np.zeros((len(x_range), len(y_range)))
for i, w_i in enumerate(x_range):
for j, w_j in enumerate(y_range):
Z[j,i] = error(X_data, y_data, [w_i, w_j])
if not ax:
fig = plt.figure(figsize=(6,6))
ax = fig.gca()
ax.set_aspect('equal')
ax.autoscale(tight=True)
cset = ax.contourf(X, Y, Z, 30, cmap=cmap, rstride=rstride,
cstride=cstride, linewidth=linewidth, alpha=alpha)
cset = ax.contour(X, Y, Z, 10, cmap=cmap, rstride=rstride,
cstride=cstride, linewidth=linewidth)
plt.clabel(cset, inline=1, fontsize=7)
return Z
Implementing the gradient descent loop as a function.
def gradient_descent(X, y, w_0, alpha, max_iters):
'Returns the values of the weights as learning took place.'
w = np.array(w_0, dtype=np.float64)
w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
w_hist[0] = w
for i in range(0,max_iters):
delta_weights = -alpha*linear_regression_gradient(X_bias, y, w)
w += delta_weights
w_hist[i+1] = w
return w_hist
w_0 = [-3,2] # we fix the initial weights to make it more illustrative
alpha = 0.05
max_iters = 25
w_hist = gradient_descent(X_bias, y, w_0, alpha, max_iters)
def plot_hist_contour(X_bias, y, w_hist, w_norm, ax=None, title=None, show_legend=False):
if not ax:
fig = plt.figure(figsize=(5,5))
ax = fig.gca()
combi=np.hstack((w_norm.reshape(2,1), w_hist.T))
bounds = (np.min(combi, axis=1)-2, np.max(combi, axis=1)+2)
plot_contour(X_bias, y, bounds, ax=ax)
ax.scatter(w_norm[0], w_norm[1], c='m', marker='D', s=50, label='$w_{norm}$')
ax.plot(w_hist[:,0], w_hist[:,1], '.:', c='b')
ax.scatter(w_hist[0,0], w_hist[0,1], c='navy', marker='o', s=65, label='start')
ax.scatter(w_hist[-1,0], w_hist[-1,1], c='navy', marker='s', s=50, label='end')
plt.xlabel('$w_1$'); plt.ylabel('$w_2$');
if title:
plt.title(title)
if show_legend:
plt.legend(scatterpoints=1, bbox_to_anchor=(1.37,1), frameon=True);
plot_hist_contour(X_bias, y, w_hist, w_norm, title='end='+str(w_hist[-1]), show_legend=True)
What is the impact of the learning rate ($\alpha$)?
def alphas_study(alphas):
fig = plt.figure(figsize=(11,7))
for i,alpha in enumerate(alphas):
ax = fig.add_subplot(2,3,i+1)
w_hist = gradient_descent(X_bias, y , w_0, alpha, max_iters)
plot_hist_contour(X_bias, y, w_hist, w_norm, ax=ax, title='$\\alpha='+str(alpha)+'$')
plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.2), frameon=True);
plt.tight_layout()
alphas = np.linspace(0.02,0.07,6)
alphas
array([ 0.02, 0.03, 0.04, 0.05, 0.06, 0.07])
alphas_study(alphas)
Idea adding some an inertial momentum to the process can:
Update $\vec{w}$ incrementally in every iteration $t$ as:
$$\vec{w}(t+1) = \vec{w}(t) + \alpha\Delta \vec{w}(t) + \beta \Delta \vec{w}(t-1),$$where $\beta\in\mathbb{R}^+$ is known as the momentum rate.
def gradient_descent_with_momentum(X, y, w_0, alpha, beta, max_iters):
w = np.array(w_0, dtype=np.float64)
w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
w_hist[0] = w
omega = np.zeros_like(w)
for i in range(max_iters):
delta_weights = -alpha*linear_regression_gradient(X, y, w) + beta*omega
omega = delta_weights
w += delta_weights
w_hist[i+1] = w
return w_hist
alpha = 0.05
beta = 0.5
max_iters = 25
w_hist = gradient_descent(X_bias, y, (-3,2), alpha, max_iters)
w_hist_mom = gradient_descent_with_momentum(X_bias,y, (-3,2), alpha, beta, max_iters)
def comparison_plot():
fig = plt.figure(figsize=(9,4.5))
ax = fig.add_subplot(121)
plot_hist_contour(X_bias, y, w_hist, w_norm, ax=ax, title='Gradient descent')
ax = fig.add_subplot(122)
plot_hist_contour(X_bias, y, w_hist_mom, w_norm, ax=ax, title='Gradient descent with momentum', show_legend=True)
plt.tight_layout()
comparison_plot()
def alphas_study_with_momentum(alphas, beta):
fig = plt.figure(figsize=(11,7))
for i,alpha in enumerate(alphas):
ax = fig.add_subplot(2,3,i+1)
w_hist = gradient_descent_with_momentum(X_bias, y , w_0, alpha, beta, max_iters)
plot_hist_contour(X_bias, y, w_hist, w_norm, ax=ax, title='$\\alpha='+str(alpha)+'$')
plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.2), frameon=True);
plt.tight_layout()
alphas_study_with_momentum(alphas, 0.5)
alphas_study_with_momentum(alphas, 0.25)
alphas_study_with_momentum(alphas, 0.75)
... I know you are wondering how $\beta=1$ looks like.
alphas_study_with_momentum(alphas, 1.05)
plt.plot(X_bias[:, 0], y, "o", label='training data')
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "D-", c='lightcoral',
linewidth=2, label='normal equation')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist[-1]), "s-", c='skyblue',
linewidth=2, label='gradient descent')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist_mom[-1]), "^-", c='green',
linewidth=2, alpha=0.5, label='gradient descent\n with momentum')
plt.xlabel("$X$"); plt.ylabel("$y$"); plt.legend(numpoints=1, bbox_to_anchor=(1.45,1), frameon=True);
N_test = 10000000
X_test_load = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test_load = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)
X_test_load_bias = np.hstack((X_test_load, np.ones((N_test, 1))))
%%time
w_norm = normal_equations(X_test_load_bias, y_test_load)
CPU times: user 788 ms, sys: 410 ms, total: 1.2 s Wall time: 1.07 s
%%time
w_grad = gradient_descent_with_momentum(X_test_load_bias, y_test_load,
np.random.random(2)/1000, 0.000001, 0.25, 25)
CPU times: user 2.62 s, sys: 753 ms, total: 3.37 s Wall time: 1.77 s
w_norm
array([-3. , 7.001])
w_grad[-1]
array([ -6.249e+43, -2.443e+43])
np.linalg.norm(w_norm - w_grad[-1])
6.7101402268503633e+43
To approximate nonlinear functions with a linear model, we have to generate nonlinear features.
In this example, we generate cosinusoidal features. You could also try radial basis functions, polynomials, ...
We expand each feature $x_j$ to the nonlinear feature vector
where $d$ is the number of basis functions.
from itertools import chain, cycle
def cosinusoidalize(X, n_degree):
X_cosinusoidal = np.ndarray((len(X), n_degree+1))
for d in range(n_degree+1):
X_cosinusoidal[:, d] = np.cos(X[:, 0] * np.pi * d / n_degree)
return X_cosinusoidal
# Utility function
def build_cosinusoidal(n_degree, w):
return ((("%+.2f \\cos(\\frac{%d \\pi x}{" + str(n_degree) + "} )") * (n_degree + 1)) % tuple(chain(*zip(w, range(n_degree + 1)))))
N_DEGREE = 4
X_cosinusoidal = cosinusoidalize(X, n_degree=N_DEGREE)
X_test_cosinusoidal = cosinusoidalize(X_test, n_degree=N_DEGREE)
X.T, X_cosinusoidal
(array([[ 0., 1., 2., 3., 4.]]), array([[ 1. , 1. , 1. , 1. , 1. ], [ 1. , 0.707, 0. , -0.707, -1. ], [ 1. , 0. , -1. , -0. , 1. ], [ 1. , -0.707, -0. , 0.707, -1. ], [ 1. , -1. , 1. , -1. , 1. ]]))
from matplotlib.lines import lineStyles
ls = cycle([style for style in lineStyles if not lineStyles[style] =='_draw_nothing'])
for deg in range(N_DEGREE):
plt.plot(np.linspace(0,4,100), X_test_cosinusoidal[:,deg],
linestyle=next(ls), label='Degree '+str(deg))
plt.legend(frameon=True, bbox_to_anchor=(1.,1))
plt.title('Cosine basis transformations'); plt.xlabel("$X$"); plt.ylabel("$y$");
w_cos = normal_equations(X_cosinusoidal, y)
plt.figure(figsize=(11,4.5))
plt.plot(X_bias[:, 0], X_bias.dot(w), "m-", label='linear')
plt.plot(X_test_bias[:, 0], X_test_cosinusoidal.dot(w_cos), "g:", label='cosinusoidal')
plt.plot(X_bias[:, 0], y, "bo", label='training')
plt.xlabel("$X$"); plt.ylabel("$y$"); plt.legend(frameon=True)
plt.title("$y = " + build_cosinusoidal(N_DEGREE, w_cos) + "$");
You might be wondering, is this cheating?
for $d$=N_DEGREES
=4?
Techniques used to account for over fitting or under fitting a given model when using a maximum likelihood approach.
where $\lambda$ is the regularization coefficient and $E_W$ is the regularization term.
A commonly used regularization term is the sum-of-squares of the model parameters,
$$E_W(\vec{w}) = \frac1{2}\vec{w}^\intercal\vec{w}\,,$$known as the weight decay regularizer.
This regularization terms leads to the optimal solution, assuming a linear regression model with Gaussian noise on the training data, of
$$ \vec{w} = \left(\lambda \vec{I} + \vec{\Phi}^\intercal \vec{\Phi}\right)^{-1} \vec{\Phi}^T\vec{y}\,. $$def y(x, m, b, mu=0, sigma=1):
return m * x + b + np.random.normal(mu, sigma, 1)[0]
def el_pow(x, pow):
temp = x
for i in range(pow - 1):
temp = temp * x
return temp
def prediction(params, x):
pred = 0
for i in range(len(params)):
pred += params[i] * math.pow(x, i)
return pred
Generating training data, with $N$ data points
N = 101
M = 8
t = np.empty(N)
domain = np.empty(N)
domain_bound = 1.0 / N
for i in range(N):
domain[i] = i * domain_bound
for i in range(N):
t[i] = y(x=domain[i], m=4.89, b=0.57)
Find the solution without using regularization design matrix, $\phi$, of dimension $N\times M$.
phi = np.array([np.ones(N), domain, el_pow(domain,2), el_pow(domain,3),
el_pow(domain,4), el_pow(domain,5), el_pow(domain,6),el_pow(domain,7)]).T
temp1 = np.linalg.inv(np.dot(phi.T,phi)) #inverse of phi.T X phi
temp2 = np.dot(temp1, phi.T)
Solution without regularization:
w_nreg = np.dot(temp2,t)
w_nreg
array([ 0.194, 29.042, -317.676, 1720.279, -4590.23 , 6374.297, -4413.416, 1202.347])
predicted_t = [prediction(w_nreg,x) for x in domain]
Finding the regularized solution
lam = 0.1
temp1 = np.linalg.inv(lam * np.eye(M) + np.dot(phi.T, phi))
temp2 = np.dot(temp1, phi.T)
w_reg = np.dot(temp2, t)
w_reg
array([ 0.793, 3.763, 1.178, 0.412, 0.146, -0.088, -0.397, -0.772])
predicted_t_reg = [prediction(w_reg, x) for x in domain]
plt.plot(domain, t, '.', label='training')
plt.plot(domain, predicted_t, label='predicted')
plt.plot(domain, predicted_t_reg, '--', label='regularized')
plt.legend(loc='lower right', frameon=True); plt.xlabel("$X$"); plt.ylabel("$y$")
plt.title('Effects of regularization');
# install with pip install version_information
%load_ext version_information
%version_information scipy, numpy, matplotlib
Software | Version |
---|---|
Python | 3.6.2 64bit [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] |
IPython | 6.1.0 |
OS | Darwin 17.0.0 x86_64 i386 64bit |
scipy | 0.19.1 |
numpy | 1.13.1 |
matplotlib | 2.0.2 |
Tue Aug 29 09:52:59 2017 -03 |
# this code is here for cosmetic reasons
from IPython.core.display import HTML
from urllib.request import urlopen
HTML(urlopen('https://raw.githubusercontent.com/lmarti/jupyter_custom/master/custom.include').read().decode('utf-8'))